Skip to main content

Full text of "Mathematical Contributions to the Theory of Evolution.--On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs"

See other formats


Mathematical Contributions to the Theory of Evolution. 489 

was always less than the standard error when only x 2 was taken into 
account, unless 

Piz = 0. 

We may now prove the similar, theorem that when we use three 
variables, x 29 a? 3 , x h on which to base the estimate, the standard error 
will be again decreased, unless 

pu = 0. 

The condition that S(% 2 ), in our present case, shall be less than 
S(r 3 ) in the last, is, in fact, 

f r 12 2 + r 13 2 + r u 2 - r ia 3 r 34 2 — %V 14 2 - r 13 ,V 24 2 1 
J _ 2(r n r l4 ;r u + r V zr l zr23+r n r 1 4r 2i ) >(l-~r 23 2 ) 

I + 2 (r 12 r u r^r u + r u r l% r 2i r 2Z +r n r n r u r u ) J 
> (rn+r u 2 -~2r n r l3 r 2Z ) (l—roJ~-r 2i 2 —r u 2 + 2r 23 r 2 4r 3 4). 

This may be finally reduced to — 

(r l4 — r 13 r 34 -~- r n r 2 i—r H r 2 ^+r n r^r 2i ^r 12 r 2d r u y > 0, 
that is /> l4 3 > 0. 

The treatment of the general case of n variables, so far as regards 
obtaining the regressions, is obvious, and it is unnecessary to give it 
at length. 

We can now see that the use of normal regression formulas is quite 
legitimate in all cases, so long as the necessary limitations of inter- 
pretation are recognised. Bravais' r always remains a coefficient of 
correlation. These results 1 must plead as justification for my use of 
normal formulas in two cases^ where the correlation was markedly 
non-normal. 



" Mathematical Contributions to the Theory of Evolution. — On 
a Form of Spurious Correlation which may arise when 
Indices are used in the Measurement of Organs." By 
Karl Pearson, F.R.S., University College, London. Ee- 
ceived December 29, 1896,— Read February 18, 1897. 

(1) If the ratio of two absolute measurements on the same or 
different organs be taken it is convenient to term this ratio an index, 

If u = jfi(aj, y) and v = / 2 (z, y) be two functions of the three variables 
x, y, z y and these variables be selected at random so that there exists 
no correlation between x,y> y,z, or z,x 9 there will still be found to 

* ' Economic Journal/ Dec, 1895, and Bee, 1896, « On the Correlation of Total 
Pauperism with Proportion of Out-relief." 



490 Prof, Karl Pearson. 

exist correlation between u and v. Thus a real danger arises when a. 
statistical biologist attributes the correlation between two functions, 
like u and v to organic relationship. The particular case that i? 
likely to occur is when u and v are indices with the same denominator 
for the correlation of indices seems at first sight a very plausible 
measure of organic correlation. 

The difficulty and danger which arise from the use of indices was 
brought home to me recently in an endeavour to deal with a consider- 
able series of personal equation data. In this case it was convenient 
to divide the errors made by three observers in estimating a variable 
quantity by the actual value of the quantity. As a result there 
appeared a high degree of correlation between three series of abso- 
lutely independent judgments. It was some time before I realised 
that this correlation had nothing to do with the manner of judging, 
but was a special case of the above principle due to the use of indices. 

A further illustration is of the following kind. Select three num- 
bers within certain ranges at random, say x, y, z, these will be pair 
and pair uncorrected. Form the proper fractions xjy and zjy for 
each triplet, and correlation will be found between these indices. 

The application of this idea to biology seems of considerable 
importance. For example, a quantity of bones are taken from an 
ossuarium, and are put together in groups, which are asserted to be 
those of individual skeletons. To test this a biologist takes the 
triplet femur, tibia, humerus, and seeks the correlation between the 
indices femur j humerus and tibia / humerus. He might reasonably 
conclude that this correlation marked organic relationship, and 
believe that the bones had really been put together substantially in 
their individual grouping. As a matter of fact, since the coefficients 
of variation for femur, tibia, and humerus are approximately equal, 
there would be, as we shall see later, a correlation of about 0*4 to 
0*5 between these indices had the bones been sorted absolutely at 
random. I term this a spurious organic correlation, or simply a 
spurious correlation. I understand by this phrase the amount of 
correlation which would still exist between the indices, were the 
absolute lengths on which they depend distributed at random. 

It has hitherto been usual to measure the organic correlation of the 
organs of shrimps, prawns, crabs, &c, by the correlation of indices in 
which the denominator represents the total body length or total cara- 
pace length. ISFow suppose a table formed of the absolute lengths 
and the indices of, say, some thousand individuals. Let an "imp " 
(allied to the Maxwellian demon) redistribute the indices at random, 
they would then exhibit no correlation* if the corresponding absolute 
lengths followed along with the indices in the redistribution, they 
also would exhibit no correlation. £Tow let us suppose the indices 
not to have been calculated, but the imp to redistribute the abso- 



Mathematical Contributions to. the. Theory of Evolution. 491 

lute lengths; these would now exhibit no organic correlation,, but 
the indices calculated from this random distribution would have a 
correlation nearly as high, if not in some cases higher than before. 
The biologist would be not unlikely to argue that the index correla- 
tion of the imp-assorted, but probably, from the vital standpoint, 
impossible beings was " organic." 

As a last illustration, suppose 1000 skeletons obtained by distribut- 
ing component bones at random. Between none of their bones will 
these individuals exhibit correlation. Wire the spuriousr , skeletons, 
together and photograph them all, so that their stature in the photo- 
graphs is the same ; the series of photographs, if measured, will show 
correlation between their parts. It seems to me that tbe biologist 
who reduces the parts of an animal to fractions of some one length 
measured upon it is dealing with a series very much like these pho- 
tographs. A part of the correlation he discovers between organs is 
undoubtedly organic, but another part is solely due to, the nature of 
his arithmetic, and as a measure of organic relationship is spurious. 

Beturning to our problem of the randomly distributed bones, let 
us suppose the indices femur/hum erus and tibia/humerus to have a 
correlation of 0*45. Now suppose successively 1, 2, 3, 4, &c, 
per cent, of the bones are assorted in their true groupings, then 
begins the true organic correlating of the bones. It starts from 0*45, 
and will alter gradually until 100 per cent, of the bones are truly 
grouped. The final value may be greater or less than 0*45, but it 
would seem that 0*45 is a more correct point to measure tbe organic 
correlation from than zero. At any rate it appears fairly certain 
that if a biologist recognised that a perfectly random selection of 
organs would still lead to a correlation of organ-indices, be would 
be unlikely to accept index-correlation as a fair measure of the rela- 
tive intensity of correlation between organs. I shall accordingly 
define spurious organic correlation as the correlation which will be 
found between indices, when the absolute values of the organs have 
been selected purely at random. In estimating relative correlation 
by the hitherto usual measurement of indices, it seems to me that a 
statement of the amount of spurious correlation ought always to be 
made. 

(2) Proposition I. — To find the mean of an index in terms of the 
means, coefficients of variation, and coefficient of correlation of the two 
absolute measurements* 

Let a?!, x 2 , # 3 , a? 4 be the absolute sizes of any four correlated organs ; 
mi, m 2 , m 3 , mi their mean values ; <r l9 <r 2 > ^3, <** their standard deviations ; 

** In all that follows, unless otherwise stated, the correlation may be of any kind 
whatever, i.e., the frequencies are not supposed to follow the Gaussian or normal 
law of error. 



492 Prof. Karl Pearson. 

■0i» ^2» ^3» ^4 their coefficients of variation, i.e, 9 <ri/«&i, o" 2 /m 2 , 03/^3, 
<r 4 /m 4 respectively ; n 2 , r 23 , r 34 , r 41 , r 24 , ^13, the six coefficients of corre- 
lation ; e l9 e 2 , e 3 , 64 the deviations of the four organs from their means, 
i,e, a?x = mi + e l9 x% = m 2 -f-€ 2 , a? 3 = m 3 4-e 3 , #4 = m 4 -fe 4 ; % 3 the mean 
value of the index xjx^ and ^ 24 the mean value of x 2 /xi ; 2 1? 2 2 the 
standard deviations of the indices Xi/x 3 and # 2 /# 4 respectively ; and n 
the total number of groups of organs. 

We shall suppose the ratios of the deviations to the mean absolute 
values of the organs are so small that their cubes may be neglected. 

Then 

w- \aj 3/ / wm 3 L\ mj/ \ m 3 / J 

lmj S( ei ) S(«.) S(6 l£3 ) S(e 3 2 )\ 
nm 3 \ m x m 3 Wim 3 ?w 3 / 

if we neglect quantities of the third order in e/m. But S(€j) = 
S(g 3 ) = 0, S(6j6 3 ) = Wi3«Tiff 3 > and S(c 3 3 ) = ^er 3 2 . 

Hence: . Ww, , , N /.\ 

*i3 = — (1-f-Va - — ns^i^) , CO- 
Similarly . m 2 e . . * % ,. n 

J * 24 = _i (l + v* 2 —^^) (11). 

Thus we see that the mean of an index is not the ratio of the means 
of the corresponding absolute measurements, but differs by a quan- 
tity depending on the correlation and variation coefficients of the 
absolute measurements. 

(3) Proposition II — To find the standard deviation of an index in 
terms of the coefficients of variation, and coefficient of correlation of 
the two absolute measurements. 



\% / 



2 



m? I \ mij \ m 3 l J 

= — „ < S { h square terms > 

m 3 L \mi m 3 /J 

= ^i3 2 O^i 2 + n-Vs 2 — 2 nr^v^s) , 
if we neglect cubic terms. 

. * . 2S I3 = iisv(y* + v^-~2ri$ViV 3 ). (iii). 



Mathematical Contributions to the Theory of Evolution. 493 

(4) Proposition III. — To find the coefficient of correlation of two indices 
in terms of the coefficients of correlation of the four absolute measurements 
and their coefficients of variation. 

Let Xijx-i and x 2 \x± be the two indices. 
w/>2 18 224 = S ( —— -i 13 ) ( ~-~ i 2 * 

m i^2 J,, e, € 3 e,e 3 e 3 2 

1-j . 1 - — i— .^ +ruVivA 

m 2 m 4 m 2 m 4 ?n> 4 

\mi m 3 / \m 2 m 4 , 
if we neglect terms of the cubic order. 

Hence, finally, 

r 1 gglgg -~ r J 4^4 — ^3^3 + ^4^4 

V Vl 2 + ^ 3 2 — 2 riai;^ ^ v* + v* — 2 r 2i v 2 Vt 

(5) Thus we have expressed p in terms of the four coefficients of 
correlation and the four coefficients of variation of the absolute 
measurements which form the indices. 

We may draw the following conclusions : 

(i.) The correlation between two indices will always vanish when 
the four absolute measurements forming the indices are quite uncor- 
related, 

(ii.) If two of the organs are perfectly correlated, let us say made 
identical : for example, the third and fourth, so that r 34 = 1, and v 3 = 
t% we find 

r n v iVj — r 13 ^ 3 — gg gaga ± v* 

\ / v 1 2 + v<?—2r li v l v. <i */v2 2 + v 3 2 —2r 2 3V 2 Vz 

This is the coefficient of correlation between two indices with the 
same denominator (xi/x^ and #2/^3) ♦ 

The value of p in (v) does not vanish if the remaining organs be 
quite uncorrelated, i.e., r l2 = n 3 = r K = 0. In this case 

This is the measure of the spurious correlation. For the special 



494 Prof. Karl Pearson. 

case in which the coefficients of variation are all the same, p = 0*5. 

When the absolute sizes of organs are very feebly correlated, then 

in most cases there will be a considerable correlation of indices. 

Example (&). Suppose three organs, x u x%, and x % to have sensibly 

equal coefficients of variation, and that the correlation of Xi and x 2 = 

r n = r and of x x and x s , as well as of x 2 and x 3 = r'. 

Then: 

1+r— 2r' 



0*5 X 



0*5 + 0*5 



l-V 
r— r 



1— r' 



This formula illustrates well in a specially simple case how the 
correlation in the indices diverges from the spurious value 0*5, as we 
alter r and r from zero, i.e., as we introduce organic correlation. 
According as r, the correlation of the numerators, is greater or less 
than r , the correlation of the numerator with the denominator, the 
actual index correlation can be greater or less than the spurious 
value. 

Example (b). If z h z 2 be the indices, then in the case of normal 
correlation the contour lines of the correlation surface for the indices 
are given by 



hzrr? ^ = constant, 



2 1 1 (l-/>*) 2,2,(1-/,*) 2/(1-^) 

where 2 b 2 2 , and p are given by (iii) and (iv) above. 

The contour lines of a surface of spurious index correlation are 
given by 

— 1-j — - —2 (-— H — = = constant, 

while the uncorrected distribution of the numerators x x and x % is 
given by the contours, 

%il<r? + %£lcr£ = constant. 

We are thus able to mark the growth of the spurious correlation 
as we increase v 3 from zero ; we see the axes of the ellipses diminishing 
and their directions beginning to rotate. 

Example (c). To find the spurious correlation between the two chief 
cephalic indices. 

I have calculated the following results from the measurements 
made on 100 " Altbayerisch " S skulls, by Professor J. Ranke. See 
his * Anthropologic der Bayern,' Bd. i, Kapifcel v, S. 194. 



Mathematical Contributions to the Theory of Evolution. 495 

Breadth of skull :* m x r= 150*47, c x == 5*8488, v x = 3*8871. 

Height of skull : m 2 = 133*78, <r 2 = 4'6761, v 2 = 3*4954. 

Length of skull : m z = 180*58, <x 3 =- 5*8441, v 3 = 3*2363. 

Cephalic index, B/L : t 13 = 83*41, 2 l3 == 3*5794, Y 13 .== 4*2913. 

Cephalic index, H/L : £>3 = 74*23, S 33 = 3*6305, V 23 = 4*8909. 

Cephalic index, H/B : ^ = 89*12, 2 21 = 4*l752, t V 2 i =4*6849. 

The coefficients of correlation may at once be deduced : 

Breadth and length : r 13 = (vf+v 3 2 -- Vi3 2 )/(2tf 2 v 3 ) .=. 0*2849. 
Height and length : r 23 = (tf 2 2 4-^3 2 — V 23 2 )/(2v 2 v 3 ) = —0*0543. 
Height and breadth : r^ = (^ 2 2 4-^i 2 — V 2 i 2 )/(2v 1 v 2 ) = 0*1243., 

This is the first table, so far as I am aware, that has been published 
of the variation and correlation of the three chief cephalic lengths.f 
It shows us that there is not at all a close correlation between these 
chief dimensions of the skull, and that a small compensating factor 
for size is to be sought in the correlation of height and length, i.e., 
while a broad skull is probably a long skull and also a high skull, a 
high skull will probably be a short skull, and a low skull a long skull. 

Without substituting the values of v h v 2 , v h r n , r 13 , r 33 in 
(v), we can find p, or the correlation between bread th/lengfch and 
height/length indices from : 

P = (V 13 2 +V 23 *-V 12 2 )/(2V 13 V 23 ). 

This follows at once from the general theorem given in my memoir 

on " Regression, Panmixia, and Heredity," ' Phil. Trans.,' vol. 187, 

A, p. 279, or by substitution of the above values of r 12 , r m , r 23 in (v), 

we find t 

p = 0*4857. 

If we calculate from (vi) the correlation between the same cephalic 
indices on the hypothesis that their heights, breadths and lengths 
are distributed at random, i.e., that our "Imp" has constructed a 
number of arbitrary and spurious skulls from Professor Ranke's 
measurements, we find : 

p = 0*4008. 

It seems to me that a quite erroneous impression would be formed 
of the organic correlation of the human skull, did we judge it by the 
magnitude of the correlation coefficient (0:4857) for the two chief 

* All the absolute measures given are in millimetres, and the coefficients of 
variation are percentage variations, i.e., they must be divided by 100 before being 
used in formulae (i), (ii), and (iii). 

f I hope later to treat correlation in man with reference to race s sex, and 
organ, as I have treated variation. 



496 Prof. Karl Pearson. 

cephalic indices, for no less than 0*4008 of this would remain, if we 
destroyed all organic relationship between the lengths on which these 
indices are based. 

Example (d). To find the spurious correlation behveen the indices 
femur J humerus and femur j tibia. 

The following results have been calculated* from measurements 
made by Koganei on Aino skeletons. (See ' Mittheilungen aus der 
medicinischen Facultat der K. J. Universitat, Tokio,' Bd. I. Tables.) 

I have kept the sexes apart although there are but few of each. 

6* Skeletons. Number = 40 to 44. Measurements in centimetres. 

Femur, F : m x = 40*845, <r x = 1*957, v x = 4*792. 

Tibia, T : %= 31*740, a 2 = 1*577, v 2 = 4*970. 

Humerus, H : m^ = 29*593, <r 3 = 1*337, v 3 = 4*517. 

The following coefficients of correlation were calculated directly ; 

Femur and tibia : r X2 =' 0*8266. 
Femur and humerus : r 13 = 0*8585. 
Tibia and humerus : r 23 == 0*7447. 

From these were deduced by the formulse of this paperf : — > 

Index, F/T : i l2 = 128*75, S ia = 3*7075. Y 13 = 2*8795. 
Index, F/H : i 13 = 137*92, 2 18 = 3*4084, Y 13 = 2*4714. 
Index, T/H: i 2Z = 107'02, 2 23 = 3*6675, Y 23 = 3*4271. 

Hence we find for the correlation of the indices F/H and T/H : 

p = 0*5644. 

But the spurious correlation, if the bones had been grouped at 
random would have been 

Po = 0*4557. 



* I have to thank Miss Alice Lee for a considerable part of the arithmetic work 
of this example. 

f The values for the indices are not in absolute agreement with those to be 
deduced from the lengths, for it was not always possible to use the same skeleton 
for femur and humerus as for tibia and humerus, i.e., sometimes one or other bone 
was missing. For the same reason, the constants for the absolute lengths do not 
agree entirely with those given for Ainos in my paper on " Variation in Man and 
Woman" in * The Chances of Death and other Studies in Evolution,* vol. 1, 
p. 303), for the simple reason that I there used every available bone, and not every 
available pair, as here. 



Mathematical Contributions to the Theory of Evolution. 497 

Tabulating the corresponding quantities for the other sex^ -we 
find 

'<j> Skeletons. Number = 22 to 24. Measurements in centimetres. 

Femur, P : m x — 38*075, a Y = 1*494, v x — 3*924. 

Tibia, T : m 2 = 29*800, <x 3 = 1*576, v 2 = 5*289. 

Humerus, H : m 3 = 27*565, <r 3 = 1*109, v 5 = 4*022. 

Femur and tibia : r 12 = 0*8457. 
Femur and humerus : r J3 =. 0*8922. 

Tibia and humerus : r 23 = 0*7277. 

Index, P/T : * 12 = 127*90, 2 12 = 3*8937, V l2 = 3*0444. 

Index, P/H : i 13 = 138*37, S 18 = 2*6930, Y 13 = 1*9462. 

Index, T/H : z 23 = 108*36, 2 23 = 4*1022, V i3 = 3*7857. 

^ = 0*6006. 
/Jo = 0*3904. 

Hence we may conclude as follows : 

(i) The absolute lengths of the long bones differ from those of the 
skull in being very closely correlated. 

(ii) The use of indices for the long bones would appear to mini- 
mise, rather than, as in the case of the skull, to exaggerate this 
correlation. 

(iii) If we measure, however, organic correlation of the indices by 
p—pQ, we shall find index correlation less than absolute length corre- 
lation for both long bones and skull, and in both cases the former 
comparatively small as compared with the latter. 

(iv) The results for the 24 female skeletons, although based on 
but few data, serve on the whole to confirm the male results.* 

(6.) Prom the above examples it will be seen that the method, 
which judges of the intensity of organic correlation by the reduction 
of all absolute measures to indices, the denominators of which are 
some one absolute measurement, is not free from obscurity ; for this 
method would give the major portion of the observed index corre- 
lation had the parts of the animal been thrown together entirely at 
random, i.e., if there were no organic correlation at all. The follow- 
ing additional remarks may be of interest. The results (iv) — (vi) 
show us that the correlation coefficients of indices are functions, not 
only of the correlation coefficients of absolute measurements, bat also 
of the coefficients of variation of the latter measurements. Hence, 

* The fact that the male is more variable in height- sitting, in femur, and in 
tibia than the female, while she appears to be more variable than he is in stature, 
led me to prophesy, in my paper on " Variation in Man and Woman," that the 
female would be found to be more closely correlated in the bones forming stature 
than the male. This appears to be the case for the femur and tibia of Ainos. 
VOL. LX. 2 P 



498 Dr. F. Galton. Note to the Memoir by 

unless the coefficients of variation be constant for local races, it is 
impossible that the coefficients of correlation can be constant for 
indices. In other words, the hypothesis of the constancy for local 
races of correlation, and that of the constancy for local races of 
variation, stand on exactly the same footing. 

The conclusions of this paper although applied to organic correla- 
tion are equally valid so far as concerns the use of indices in judging 
the correlation of either physical or economic phenomena. It was, 
indeed, a difficulty arising from my discussion of personal judgments 
— a spurious correlation between the judgments of different observers 
— which first drew my attention to the matter. 

Note, January 13, 1897. — The result described by Professor 
Pearson evidently affects the value of the correlation coefficients 
determined by me in Grangon and Carcinus (' Roy. Soc. Proc./ vols* 
51 and 54), because I have always expressed the size of the organs 
measured in terms of body length. 

In order to show the effect of this, I have lately performed, at 
Professor Pearson's suggestion, the following experiment : It happens 
that my measures of Plymouth shrinrps are recorded in a book, in 
the order in which they were measured, and therefore at random as 
regards carapace length or other characters. I constructed from 
these records 420 "spurious" shrimps, in the following way: the 
total length of the first shrimp in the book was associated with the 
carapace length of the tenth shrimp and the " posfc-spinous length " 
of the twentieth, and so throughout. Evidently these three measures 
were associated at random, and we might expect that these spurious 
shrimps would show no organic correlation ; but w r hen the cara- 
pace lengths and " post-spinous lengths " of these spurious shrimps 
were divided by the body length, and the correlation between the 
resulting indices was determined, the value of r was found to be 0'S8 7 
the value for real shrimps being 0*81, or the correlation due to the 
use of indices forms 47 per cent, of the observed value. 

W. P. R. Weldok. 



" Note to the Memoir by Professor Karl Pearson, F.K.S., on 
Spurious Correlation." By Francis GrALTON, F.R.S. Re- 
ceived January 4, — Read February 18, 1897. 

I send this note to serve as a kind of appendix to the memoir of 
Professor K. Pearson, believing that it may be useful in enabling 
others to realise the genesis of spurious correlation. It is important 
though rather difficult to do so, because the results arrived at in the 
memoir, which are of serious interest to practical statisticians, have 
at first sight a somewhat paradoxical appearance.